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Abstract 

Matrix factorization from a small number of observed entries has recently garnered much attention as the key ingredient of 
successful recommendation systems. One unresolved problem in this area is how to adapt cuiTent methods to handle changing user 
preferences over time. Recent proposals to address this issue are heuristic in nature and do not fully exploit the time-dependent 
structure of the problem. As a principled and general temporal formulation, we propose a dynamical state space model of matrix 
factorization. Our proposal builds upon probabilistic matrix factorization, a Bayesian model with Gaussian priors. We utilize results 
in state tracking, i.e. the Kalman filter, to provide accurate recommendations in the presence of both process and measurement 
noise. We show how system parameters can be learned via expectation-maximization and provide comparisons to current published 
{^J^ techniques. 

■ Index Terms 

■ collaborative filtering, Kalman filtering, recommendation systems, expectation-maximization, learning 

I. Introduction 

^ 1 , Matrix factorization (MF), the decomposition of a matrix into a product of two simpler matrices, has a long and storied 
^ ■ history in statistics, signal processing, and machine learning for high-dimensional data analysis The approach garnered 
Q , much attention for its successful application to recommendation systems based on collaborative filtering, including the Netflix 
prize problem f2l. Recommendation is of interest in a variety of domains. The most common examples are recommending 
', movies, television shows, or songs that a particular individual will rate highly, but there are many other examples. Business 
^ ■ analytics examples from marketing and salesforce management include recommending products to a salesperson to cross-sell 
00 . and upsell that a particular customer is likely to purchase, and recommending a sales team to a sales manager that will be able 
0^ ■ to successfully sell to a particular business customer 

In these domains, customer preferences often follow a trajectory over time. Customers may be interested in basic products 
at first and then higher-end products later, or products for toddlers first and for adolescents later; customers may need a 
[ sales team with greater relationship-building expertise at first and technical expertise later Additionally, we can distinguish 
recommendation for discovery and recommendation for consumption; new items are recommended in the former whereas the 
same item may be repeatedly recommended in the latter 

The MF approach to collaborative filtering usually includes Frobenius-norm regularization ||2l, which is supported by a 
linear-Gaussian probabilistic model known as probabilistic matrix factorization or PMF |3i. Due to its hnear-Gaussian nature, 

■ PMF lends itself to incorporating temporal trajectories through the state space representation of linear dynamical systems ||4| 
5—1 \ and algorithms for estimation based on the Kalman filter IS), JD. We propose a general recommendation model of this form 
^ ■ and develop an expectation-maximization (EM) algorithm to learn the model parameters from data. The Kalman filter and 

Rauch-Tung-Striebel (RTS) smoother f]\ appear in the expectation step of the EM. 

Several recent works also address dynamic and temporal issues in recommendation. TimeSVD, a component of the Netflix 
prize-winning algorithm, addresses temporal dynamics through a specific parameterization with factors drifting from a central 
time, but unlike our general formulation can only handle limited temporal structure f2|. The probabilistic tensor factorization 
approach does not take temporal causality into account as we do |8J. The formulation of |2J is based on nearest neighbor 
collaborative filtering rather than MF, and is known to have scaling difficulties. The spatiotemporal Kalman filter of ifTOl has 
a limited state evolution and convergence issues, target tracking in recommendation space has no element of collaboration 
and requires prior knowledge of the 'recommendation space' liTTI . and the hidden Markov model for collaborative filtering 
only captures evolution of a known attribute over time among users lfT2l . The contribution of our paper is the development 
of a principled and general MF-based approach to recommendation and predictive analytics, including recommendation for 
consumption, in which the given data samples arrive over time and contain significant time dynamics. 
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II. Probabilistic Matrix Factorization 

Recommendation systems comprise N users and M items, along with some measure of preference represented by a matrix 

e M^^^^. For most practical applications, only a small fraction of the entries of O are observed and are usually corrupted 
by noise, quantization, and different interpretations of the scaling of preferences. In MF, each user and item is represented by 
a row vector of length K denoted Ui and Vj respectively, corresponding to weights of K latent factors. We concatenate the 
factors into matrices U £ R^^^ and V G M^^^^. The preference matrix is then O = UV'^, meaning the preference of user 

1 for item j is 0^ = {ui,Vj), a common assumption in recommendation systems ||2l. 

Under MF, latent factors are learned from past responses of users rather than formulated from known attributes. Factors are 
not necessarily easy to interpret and change dramatically depending on the choice of K. The value of K is an engineering 
decision, balancing the tradeoff of forming a rich model to capture user behavior and being simple enough to prevent overfitting. 

Given K, a common way to learn factor matrices and consequently the complete preference matrix O from limited 
observations is the following program: 

min J2 {o^,-u^vJ)' + X,\\U\\l + X2\\V\\l, (1) 

where the set O contains observed preference entries, and (Ai, A2) are regularization parameters. This program can be solved 
efficiently using stochastic gradient descent and has been experimentally shown to have excellent root mean-square error 
(RMSE) performance ||2|. 

More recently, the regularization of the above program was motivated by assigning Gaussian priors to the factor matrices U 
and V respectively ||3]. Coined PMF, this Bayesian formulation means ([TJ is justified as producing the maximum a posteriori 
(MAP) estimate for this prior. In this case, the regularization parameters Ai and A2 are effectively signal-to-noise ratios (SNR). 
Since O is not a linear function of latent factors, the MAP estimate does not in general produce the best RMSE performance, 
which is the measure commonly desired in recommendation systems. However, wisdom gained from the Netflix Challenge 
and experimental validation from jSJ show that the MAP estimate provides very competitive RMSE performance compared to 
other approximation methods. 

III. State Space Model 

Given the success of MAP estimation in linear-Gaussian PMF models and our interest in capturing time dynamics, we 
propose a linear-Gaussian dynamical state space model of MF whose MAP estimates can be obtained using Kalman filtering. 
We assume that user factors Ui{t) are functions of time and hence states in the state space model, with bold font indicating 
the vector being random. In our proposed model, we have coupled dynamical systems, and to adhere to typical Kalman filter 
notation, we use 4 = Ui{t) to denote the state of user i at time t. 

For each user, the initial state x^o is distributed according to N{^i,Tii), the multivariate Gaussian distribution with mean 
vector and co variance matrix E;. The user-factor evolution is linear according to the generally non-stationary transition 
process An and contains transition process noise t ^ A/'(0, Qi^t) to capture variability of individuals. Taken together, the 
state evolution is described by the set of equations: 

Xj,t = Ai^tXi,t_i + Wi,t i = l,...,iV. (2) 

We assume that item factors evolve very slowly and can be considered constant over the time frame that preferences are 
collected. Also, due to the sparsity of user preference observations, a particular user-item pair at a given time t may not be 
known. Thus, we incorporate the item factors through a non-stationary linear measurement process Hi t which is composed 
of subsets of rows of the item factor matrix V based on item preferences observed at time t by user i. Note that all Hi t are 
subsets of the same fixed V and are coupled in this way. We also include measurement noise z^ t ~ AA(0, Ri,t) in the model. 
The overall observation model is: 

yt^t ^ H^,t^i.t + Zi,t i = l,...,N. (3) 

The product Hi t'x.i^t in © parallels the {ui,Vj) product in Sec. Again adhering to Kalman filter notation, we use j/^ t to 
denote the observations, corresponding to the observed entries of O, now a tensor in r^^x^x?" 

The state space model can be generalized in many different ways that may be relevant to recommendation systems, including 
non-Gaussian priors, nonlinear process transformation and measurement models, and continuous-time dynamics. We focus on 
the linear-Gaussian assumption and defer discussion on extensions to Sec. |VT] 

IV. Factorization via Learning 

In the regularized formulation ([Til, the MF task is learning user and item factors given sparse observations. In our dynamical 
model, MF is composed of two dependent tasks: learning model parameters that govern the motion of states, and performing 
MAP estimation of user factors. 
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Given the model parameters and access to observations for all past times i = 1, . . . ,T, the MAP estimate of U{t) can be 
found using a noncausal Kalman filter called the RTS smoother In the PMF setting, N of these RTS smoothers run in parallel, 
all of which share the same item factor matrix V in the measurement process. We call this architecture collaborative Kalman 
filtering (CKF). Let the Kalman estimates and covariances be defined as: 

^i.t\.s = E [xi.t I yi,i, . . ■,yi,s] (4) 
Pz,t\s = Var(xj,t I yi^i, . . .,yi^s) ■ (5) 
Then, the Kalman filtering equations are as follows: 

^i,t+l\t = ^i,t+l ^i,t\t^ (6) 
Pi,t+l\t = Ai,t+1 Pi,t\t ^I,t+1 + Qi,U (7) 

^i,t\t = ^i,t\t~i + Ki^t {yi,t ~ H^,t Xi,t|t_i) ; (8) 
P^,t\t^P^At-l~K^,tH,,tP,^^\t_^. (9) 

Moreover, the RTS smoothing equations are: 

^i,t\T = ^i,t|t + Ji,t (Xi^t+1|T ^ Si,t+l|t) \ (10) 

Pi,t\T = Pi,t\t + Ji.t {Pi,t+l\T ~ Pi,t+l\t) ^^^^ 

Pi,T,T-l\T = {I - Ki,T Hi,t) Ai^T Pi,T-l\T-l] (12) 

Pi,t,t-l\T = Pi,t\t jTt-l (13) 
+ Ji,t {Pi,t+l,t\T ~ Pi,t\t) jf^t-lJ 

where 

K^,t = P^,t\t-i Hit {H,.t P,^t\t-i Hit + ^^,0"' ; (14) 

Ji,t ~ Pi.t\t AftJf-i Pi/\t~l- (15) 

The CKF steps above fit naturally in the expectation step of the EM algorithm used to learn model parameters such as mean 
and covariance of the initial states, the transition process matrices, the process noise covariances, the measurement process 
matrices, and the measurement noise covariances. In learning the measurement process matrices, we also get an estimate for 
the item factor matrix V, which is the other ingredient in the MP problem. The EM algorithm proceeds by alternating between 
the expectation step in which the expectation of the likelihood of the observed data is evaluated for fixed parameters, and the 
maximization step in which the expected likelihood is maximized with respect to the parameters. 

The model proposed in Sec.|III]is a fully general Gaussian state space model whose parameters could be learned, but would 
require many observation samples. In typical applications however, the observations are sparse and the parameters are heavily 
correlated in time; thus we simplify the model to reduce the number of parameters to learn. First, we make the approximation 
that parameters are independent of time, meaning they do not change during the observation period. Second, we take the initial 
states to be iid A/^(0,(t^) for all users, meaning they are homogeneous enough to share similar scalings of preferences. Last, 
we assume both process and measurement noise are iid, reducing the learning to variances ctq and aj^ respectively. With these 
simplifications, the variance parameters can be chosen to maximize the log-likelihood as follows: 

N 

NK 



r^^'^^iPiMT+^iMT^lMT) (16) 

i=l 
^ NT 

^EEt'^K^MIT+XMiTirtir) (17) 



MKT 

i=l t=l 

- 2 {P^.t,t-l\T + X,,t|T X^t-l|T 1 A' 



1 

W\ 



A [P^,t-l\T + X,,t_i|T Kt-1\T) A' j 
TV T 

Y,Y.^Hy^-tylt) (18) 



,i=l t=l 
N T 



N T 

E E (^».* {P^MT + ^^,t\T x^tir) H. 



=1 t=l 
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where tr( ) denotes the trace operator. Learning of general E, Q and R is discussed in Appendix B. 

Expressions for the transition and measurement process matrices that maximize log-hkelihood are derived to be: 

A = A2A'[^ where (19) 

TV T 

i=l t=l 
N T 

^2 = XI X(-^'.*'*-l|T^ + ^'-tlT ^lt-l\T) 
i=l t=l 

Vj=V2,jVi{3)-^ j^l,...,M where (20) 

N T 

i=l t=l 
N T 

Remembering that each yt^t is a subvector of M^^ corresponding to items observed at time t, the fill operator expands its 
argument back to M*^, with the observations in its appropriate positions and zeros elsewhere. We denote the jth rows of V 
and V2 as Vj and V2,j respectively, and \oi t as the indicator function that a rating is observed for user i and item j at time 
t. Derivations for these parameters are discussed in Appendix B. 

V. Empirical Results 

To vaUdate the effectiveness of Kalman learning compared to existing methods, we present results tested on generative data 
that follow a state space model. For this work, two main reasons led to our decision to use generative data rather than common 
datasets such as Netflix. First, a goal of the work is to understand how algorithms perform on preferences that evolved following 
a state space model. It is not clear that common datasets used in the recommendation systems literature match this model, and 
results would be too data-specific and not illuminating to the goal at hand. Second, a generative dataset gives insight on how 
the algorithms discussed perform in different parameter regimes, which is impossible in collected data. 

We generate the item factor matrix V iid A/'(0, Cy ) and the initial user factor matrix U{i)) iid A/'(0, cr^). Under the assumption 
that user factors do not change much with time, the stationary transition process matrix A is the weighted sum of the identity 
matrix and a random matrix, normalized so that the expected power of the state x^.t is constant in time. We note that A can be 
more general with similar results, but the normalization is important so that preference observations do not change scales over 
time. Finally, iid noise is added to both the transition and measurement processes as described in (|2]) and (O. The observation 
triplets {i,j,t) are uniformly drawn iid from all possibilities from the preference tensor. 

We present performance results for a particular choice of parameters in Fig.[Tl expressed in RMSE. Space limitations prevent 
us from presenting results for other parameter choices, but they are similar when the SNR is reasonable. For arbitrary initial 
guesses of the parameters, we find learning of variances and process matrices to converge and stabilize after about 10-20 EM 
iterations. As a result, state tracking is reliable and approaches the lower bound specified by the Kalman smoother output when 
the parameters, including the item factor matrix V, are known a priori. The estimate for the entire preference tensor O also 
performs well, meaning that CKF is a valid approach for recommendation systems with data following a state space model. 

In contrast, current algorithms such as SVD and timeSVD perform poorly on this dataset because they cannot handle general 
dynamics in user factors. Thus, the algorithm becomes confused and the estimates for the factor matrices tend to be close to 
zero, which is the best estimate when no data is observed. As shown in Fig. [2] the true trajectory of users may be that of an 
arc in factor space with additive perturbations. While CKF is able to track this evolution using smoothed and stable estimates, 
both SVD and timeSVD fail to capture this motion and hence have poor RMSE. SVD does not have temporal considerations 
and would give a stationary dot in the factor space. Meanwhile, timeSVD can only account for drift, meaning it can move 
in a linear fashion from a central point. In fact, this constraint leads to worse RMSE for most parameter choices than SVD 
because timeSVD overfits an incorrect model. 

VI. Conclusion 

In this paper, motivated by recommendation systems for consumption that arise in business analytics, we have proposed an 
extension to Gaussian PMF to take into account trajectories of user behavior. This has been done using a dynamical state space 
model from which predictions were made using the Kalman filter. We have derived an expectation-maximization algorithm to 
learn the parameters of the model from previously collected observations. We have validated the proposed CKF and shown its 
advantages over SVD and timeSVD on generated data. Future work underway includes testing and comparison on real-world 
collected data. 
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Fig. 1. For this testbencli, we set model dimensions to be {M, N, T, K) = (500, 500, 20, 5) and variances to be [cr^j , Uy , cr^ , cr^n) = (1, 1, 0.05, 0.1). 
The item factor dynamics are controlled by A, which is a weighted average of identity and a random dense matrix. The sampling factor is 0.005, meaning 
only 0.5% of the entries of the preference matrix are observed. For the generated data and crude initial guesses of the parameters, RMSE performance is 
given for estimation of (a) Kalman parameters learned via EM; (b) user factors/states; and (c) the preference matrix. We observe that EM learning is effective 
in estimating pai'ameters through noisy data, and this translates to better state tracking and estimation of the preference matrix. Convergence is fast and robust 
to initialization of parameters. 




- X - True 



Fig. 2. State-tracking abihty of CKF and timeSVD in thi'ee factor dimensions. The true user factors are well-tracked using CKF after parameters have been 
learned. However, timeSVD does not have flexibility to track general state evolutions and gives poor RMSE. 



In contrast to heuristic and limited prior methods that incorporate time dynamics in recommendation, the approach proposed 
in this paper is a principled formulation that can take advantage of decades of developments in tracking and algorithms for 
estimation. To break away from linearity assumptions, the extended or unscented Kalman filter can be used. Particle filtering 
can be used for non-Gaussian distributions, analogous to sampling-based inference in Bayesian PMF |fT3]| . 

Based on the state space model, we can also include a control signal in future work to control what items are recommended 
to users. There are a variety of reasons to include a control signal: certain items may have corporate sponsorship or are high 
revenue items, or certain media files may be cached at a wireless base station and it is inexpensive to serve those items. The 
proposed dynamic formulation can also be extended in a control-theoretic way to address the cold start problem: recommending 
items to users with no or little previous expressed preferences. 



Appendix A 
Useful Facts From Matrix Theory 

We present some useful facts for the below derivations 11411 : 
Fact 1. For x - 7V(/i, S), 

E[(x - m)^S-1(x - ^i)] = tr (S-i E[(x - m)(x - i^f]) 

Fact 2. 
Fact 3. 

dtr(X-'^A) , , ,,T 
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Fact 4. For square matrices A and B, 

dtr {AX^) 



dX 

Fact 5. For square matrices A and B, 



= A. 



dtr(AXBX'^C) ^ ^ ^ 



dX 



A' C XB' + CAXB. 



Appendix B 
Determining EM parameters 

We now derive the EM-parameter equations given in (fT6]l-(l20li. In the maximization step of the EM algorithm, we solve 
for parameters that maximize the expected joint likelihood: 



^("+1) = argmaxE 



(21) 



where 6^^' is the guess of the true parameter set on the nth iteration. It is common to consider the log-likelihood to change 
the products in the joint likelihood to summations; the maximizing parameters are the same for either optimization. The below 
derivations reference proofs in ||6l Chap. 13] and fTSl. 



A. Simplification of log-likelihood 

For a collaborative Kalman filter, the log-likelihood simplifies to 

N NT NT 

logL = \ogp{x,y;e) = ^\ogp{x.ifi) + ^^\ogp{xi^t\xi,t-i) + ^^^ogp(ii.i^t\xi,t), (22) 
1=1 1=1 t=i 1=1 t=i 



with 

p{xi^t\xi,t~i) ^ N'ixi^t; Ai^t Xi^t-i,Qi), 

piViA^i.t) ^ N'ivt.t; Hi^Xi^t.Ri)- (23) 

Using Fact[Tl the first term Li becomes 

E[ii] = ^ log(2^) - - ^ log - - ^ tr (Sri E[(x,,o - mO(x.,o - ■ (24) 

i=l i=l 

We then use the identity x^.o — IJ-i ~ (xi,o ^ x, o|t) + {^ifi\T ~ Mi) ™d note that estimation error and innovation of a Kalman 
filter are uncorrelated to rewrite the expectation of Li to be 

1 ^ 1 ^ 

E[ii] = ci - - ^ log |E,| - 2 E (^r' {P'mr + (*.,o|T - mO(x.,o|t - mO^)) ■ (25) 

■i=i 1=1 

We repeat a similar procedure for L2, this time using the identity 

— Ai^t Xi,t-1 = — Ai^t X,;,(_i + Xj_t|7- — Xi_t|T + Ai^t '^id-l\T ~ Ai,t ^iA-l\T 

= (xi,t|T - Ai,tXi,t-ilT) - (xi,tiT - Xi,t) + Ai^t (xi,t-i|T - Xi,t-i) . (26) 
We then rewrite the expectation of L2 as 

NT K T 1 . 

E[L2] = ^ log(27r) - - 5] log lO.I - 2 E E (^r' E[(x,,, - A:,*)(xm - A,,,)^]) 

1=1 i=l t=l 

« y AT T 

= C2 - E 9"^°gl*5»l ^ 9 EE*"^ {Qi^^ [{{^i,t\T - AiAi,t-l\T) - (xi,t|T -Xi,t) + A,,t (Xj,t-1|T -Xi,t_i)} 



2 ' 2 

4=1 1=1 t = l 



{ (xi,t|T - A,t^i,t-1\T) - (xi,t|T - ^i.t) + A.t (Xi,t-1|T - X^^t-i) } 



(27) 
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Expanding everything and again noting that the Kalman estimation error and innovation are uncorrected, ( |27] i simpHfies to 

N T 

I , — , I 

E[ 



1^ 

H.t 



1=1 4 = 1 t=l 

+A^,t (P,,t-1|T + x,,t_i|Tx^t_i|T) ^m}) • (28) 

A similar derivation is employed for L3 utilizing 

yi,t - Hi^t y^id = (yi,t - H^^t x,;,f|T) + H.,^t{^i,t\T - x,.*). (29) 

Some care is needed in writing out Ri in L3 since y^.t can be of different lengths depending on the observation tensor O 
and hence only a subset of the noise covariance matrix is required at each time step. To circumvent this issue, we define a 
fill function that expands the observation vector back to K.^^ and a diagonal binary matrix t S j^mxm ^j^j^ ^j^g^ jj^g 
diagonal positions where ratings are observed for user i at time t. 

Currently, the formulation is extremely general and parameters may change with users and in time. We can maximize with 
respect to the log-likelihood but the resulting estimation would be poor and does not exploit the possible similarities between 
a population of users. To fully realize the benefits of CKF, we make simplifying assumptions that /i; = 0, = E, = A, 
Qi = Q, and Ri = ct^Im- We now move summations into the trace operator and the log-likelihood simplifies to 

E[Li] = ci - y log lEl - i tr (E-^Fi) , (30) 
NT 1 

E[L2] = C2 - — log 101 - - tr(g-ir2), (31) 



E[L3]=C3-^log4-^tr(r3), (32) 



where 

N 

Ti = X] V '-MT + Xj^olT x^o|t) ' (33) 



1=1 

N T 

Ea - ^ 

=1 t=i 



iV 1 



(34) 



N T 



- X E fill(2/«,t)^ - 2 fill(yM) ^lt\T + (^M^) [P^AT + *m|t x^^it) • (35) 

1=1 t=i 

B. Determining S, Q fl«(i i? 

To maximize with respect to E, we can differentiate ( |30] |, set to zero, and solve. Using Facts |2] and |3] 

^^ = -y(S ) +-(E EiE ) , (36) 

and solving gives 

S = ^Ei. (37) 



If we had further assumed that E = erf, Ik, then (|30] | would simplify to 

E[ii] = ci - — log 4 _ _ tr (El) , 

and maximization yields (fTSl l. 

The derivations for Q and i? follow similarly and lead to ( fTTI i and (fTSl l respectively. 
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C. Determining A and V 
Rewriting (ISTT i as 



E[i2] =CA+ tr(g-M2^^) - \ tv{Q-^AA^A^), (38) 



where ca is the collection of terms that do not depend on A, we maximize using the same procedure as for E. We utilize 
Facts |4] and |5] while noting that Ai, A2 and Q are symmetric and invertible, and the maximization yields (fT9] l. 
Following a similar procedure for optimization of V, we express ( |32] | as 

1 1 / ^ ^ \ 

E[L3] = cv + — tr {V2V^) -i^triJ^Y. J^,tVV,,tV^ jf^ . (39) 

In this case, V cannot be expressed as a matrix product, but each row can. Noting Ji^t = Jft ™^ Ji,tJi,t = Ji,t, the 
maximization over each row yields (|20] |. 
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